Simple random selection of training and testing folds in the
structured environment leads to an underestimation of error in the
evaluation of spatial predictions and may result in inappropriate model
selection (Telford and Birks, 2009; Roberts et al., 2017). The use of
spatial and environmental blocks to separate training and testing sets
has been suggested as a good strategy for realistic error estimation in
datasets with dependence structures, and more generally as a robust
method for estimating the predictive performance of models used to
predict mapped distributions (Roberts et al., 2017). Package
blockCV provides functions to separate train and test sets
using buffers, spatial and clustering blocks
(spatial and environmental), i.e. generates folds for k-fold
and leave-one-out (LOO) cross-validation. It provides several
options for how those blocks are constructed. It also has a function
that applies geostatistical techniques to investigate the existing level
of spatial autocorrelation in the response or covariates to inform the
choice of a suitable distance band by which to separate the data sets.
The package has been written with species distribution modelling
(SDM) in mind, and the functions allow for a number of common
scenarios (including presence-absence and presence-background species
data, rare and common species, raster data for predictor variables).
Although it can be applied to any spatial modelling
e.g. multi-class responses for remote sensing image classification.
Please cite blockCV by: Valavi R, Elith J,
Lahoz-Monfort JJ, Guillera-Arroita G. blockCV: An R package for
generating spatially or environmentally separated folds for k-fold
cross-validation of species distribution models. Methods Ecol Evol.
2019; 10:225–232. doi:
10.1111/2041-210X.13107
There are major updates in blockCV
v3.0. All function names have changed to more general
names starting with cv_*. The old functions (v2.x) still
work, but they will be removed in future versions. Please update your
code with the new functions presented bellow.
Some new updates:
cv_spatial,
cv_cluster, and cv_buffercv_spatial_autocor function now calculates spatial
autocorrelation range for either the response (i.e. the binary or
continuous data) or a set of continuous raster covariates (as
before)cv_plot function can be used to plot the folds
of all blocking strategy with ggplot facetsterra package is now used for all raster processing
with support for stars and raster objects (or
files on disk)cv_similarity provides measures of possible
extrapolation to testing foldsThe blockCV is available in CRAN and the latest update
can also be downloaded from GitHub. It is recommended to install the
dependencies of the package. To install the package use:
# install stable version from CRAN
install.packages("blockCV", dependencies = TRUE)
# install latest update from GitHub
remotes::install_github("rvalavi/blockCV", dependencies = TRUE)
# loading the package
library(blockCV)
The package contains the raw format of the following data:
.tif).csv)These data are used to illustrate how the package is used. The raster data include several bioclimatic variables for Australia. The species data include presence-absence records (binary) of a simulated species.
To load the package raster data use:
library(sf) # working with spatial vector data
library(terra) # working with spatial raster data
library(tmap) # plotting block and maps
# load raster data
# the pipe operator |> is available for R >= 4.1
rasters <- system.file("extdata/au/", package = "blockCV") |>
list.files(full.names = TRUE) |>
terra::rast()
The presence-absence species data include 481 presence
points and 514 absence points.
# load species presence-asence data and convert to sf
points <- read.csv(system.file("extdata/", "species.csv", package = "blockCV"))
head(points)
## x y occ
## 1 1168236.6 -3824513 0
## 2 757436.3 -3927213 0
## 3 1767320.5 -3679021 1
## 4 1852903.9 -3251104 1
## 5 1005628.2 -4106938 1
## 6 1416428.5 -4260988 1
The appropriate format of species data for the blockCV
package is simple features (from the sf package). The data
is provide in GDA2020 / GA LCC
coordinate reference system with "EPSG:7845" as defined by
crs = 7845. We convert the data.frame to
sf as follows:
pa_data <- sf::st_as_sf(points, coords = c("x", "y"), crs = 7845)
Let’s plot the species data using tmap package:
tm_shape(rasters[[1]]) +
tm_raster(legend.show = FALSE, n = 30, palette = gray.colors(10)) +
tm_shape(pa_data) +
tm_dots(col = "occ", style = "cat", size = 0.1)
The function cv_spatial creates spatial blocks/polygons
then assigns blocks to the training and testing folds with
random, checkerboard pattern or in a
systematic selection. Spatial blocks can be defined either by
size or number of rows and columns.
Consistent with other functions, the distance (size)
should be in metres, regardless of the unit of the
reference system of the input data. When the input map has
geographic coordinate system (decimal degrees), the block size
is calculated based on dividing size by 111325 (the
standard distance of a degree in metres, on the Equator) to change metre
to degree. This value can be changed by the user via the
deg_to_metre argument.
The offset can be used to shift the spatial position of
the blocks in horizontal and vertical axes, respectively. This only
works when the block have been built based on size. The
blocks argument allows users to define an external spatial polygon as
blocking layer.
Here are some spatial block settings:
sb1 <- cv_spatial(x = pa_data,
column = "occ", # the response column (binary or multi-class)
k = 5, # number of folds
size = 350000, # size of the blocks in metres
selection = "random", # random blocks-to-fold
iteration = 50, # find evenly dispersed folds
biomod2 = TRUE) # also create folds for biomod2
The same setting can be used to create square blocks. You can optionally add a raster layer for to be used for creating blocks and be used in the background of the plot.
sb2 <- cv_spatial(x = pa_data,
column = "occ",
r = rasters, # optional
k = 5,
size = 350000,
hexagon = FALSE,
selection = "random",
iteration = 50,
progress = FALSE,
biomod2 = TRUE)
##
## train_0 train_1 test_0 test_1
## 1 438 399 76 82
## 2 412 335 102 146
## 3 414 383 100 98
## 4 426 399 88 82
## 5 366 408 148 73
The assignment of folds to each block can also be done in a
systematic manner using selection = "systematic".
sb3 <- cv_spatial(x = pa_data,
column = "occ",
k = 5,
size = 350000,
hexagon = FALSE,
selection = "systematic",
iteration = 50)
##
## train_0 train_1 test_0 test_1
## 1 417 339 97 142
## 2 334 391 180 90
## 3 468 388 46 93
## 4 432 426 82 55
## 5 405 380 109 101
Or a checkerboard pattern using
selection = "checkerboard".
sb4 <- cv_spatial(x = pa_data,
column = "occ",
size = 350000,
hexagon = FALSE,
selection = "checkerboard",
iteration = 50)
##
## train_0 train_1 test_0 test_1
## 1 241 306 273 175
## 2 273 175 241 306
Let’s visualise the checkerboard blocks with tmap:
tm_shape(sb4$blocks) +
tm_fill(col = "folds", style = "cat")
The blocks can also be created by number of rows and columns:
sb5 <- cv_spatial(x = pa_data,
column = "occ",
r = rasters,
k = 5,
rows_cols = c(10, 2), # for hexagonal only first one is used
hexagon = FALSE,
selection = "random",
iteration = 50,
progress = FALSE)
##
## train_0 train_1 test_0 test_1
## 1 436 328 78 153
## 2 447 450 67 31
## 3 405 409 109 72
## 4 349 337 165 144
## 5 419 400 95 81
The function cv_cluster uses clustering methods
to specify sets of similar environmental conditions based on the input
covariates. Species data corresponding to any of these groups or
clusters are assigned to a fold. Alternatively, the clusters can be
based on spatial coordinates of sample points (the x
argument).
Using spatial coordinate values for clustering:
# spatial clustering
scv <- cv_cluster(x = pa_data,
column = "occ", # optional: counting number of train/test records
k = 5)
## Warning in cv_cluster(x = pa_data, column = "occ", k = 5): Fold 1 has class(es)
## with zero records
## train_0 train_1 test_0 test_1
## 1 490 481 24 0
## 2 414 456 100 25
## 3 318 293 196 188
## 4 378 244 136 237
## 5 456 450 58 31
The clustering can be done in environmental space by supplying
r. Notice, this could be an extreme case of
cross-validation as the testing folds could possibly fall in novel
environmental conditions than what the training points are (check
cv_similarity for testing this). Note that the input raster
layer should cover all the species points, otherwise an error will rise.
The records with no raster value should be deleted prior to the
analysis.
# environmental clustering
ecv <- cv_cluster(x = pa_data,
column = "occ",
r = rasters,
k = 5,
scale = TRUE)
## train_0 train_1 test_0 test_1
## 1 494 337 20 144
## 2 431 467 83 14
## 3 287 419 227 62
## 4 429 269 85 212
## 5 415 432 99 49
When r is supplied, all the input rasters are first
centred and scaled to avoid one raster variable dominate the
clusters.
By default, the clustering will be done based only on the values of
the predictors at the sample points. In this case, and the number of the
folds will be the same as k. If
raster_cluster = TRUE, the clustering is done in the raster
space. In this approach, the clusters will be consistent throughout the
region and across species (in the same region). However, this may result
in cluster(s) that cover none of the species records especially when
species data is not dispersed throughout the region (or environmental
ranges) or the number of clusters (k or folds) is high.
The function cv_buffer generates spatially separated
training and testing folds by considering buffers of specified distance
around each observation point. This approach is a form of leave-one-out
(LOO) cross-validation. Each fold is generated by excluding nearby
observations around each testing point within the specified distance
(ideally the range of spatial autocorrelation). In this method the test
set never directly abuts a training set.
Using buffering to create CV folds:
# buffering with presence-background data
bloo <- cv_buffer(x = pa_data,
column = "occ",
size = 400000)
For species presence-absence data and any other
types of data (such as continuous,
counts, and multi-class targets) keep
presence_background = FALSE. In this case, all sample
points other than the target point within the buffer are excluded, and
the training set comprises all points outside the buffer.
When using species presence-background data (or
presence and pseudo-absence), you need to supply the column
and set presence_background = TRUE. In this case, only
presence points are considered as target points. For more information
read the details section in the help of the function
(i.e. help(cv_buffer)).
You can visualise the generate folds for all block cross-validation
strategies. You can optionally add a raster layer for background maps
using r option. When r is supplied the plots
might be slightly slower.
Let’s plot spatial clustering folds created in previous section
(using cv_cluster):
cv_plot(cv = scv,
x = pa_data,
r = rasters) # optionally add a raster background
When cv_buffer is used for plotting, only first 10 folds
are shown. You can choose any set of CV folds for plotting. If
remove_na = FALSE (default is TRUE), the
NA in legend shows the excluded points.
cv_plot(cv = bloo,
x = pa_data,
num_plots = c(1, 50, 100))
If you do not supply x when plotting a
cv_spatial object, only the spatial blocks are plotted.
cv_plot(cv = sb1,
r = rasters,
raster_colors = terrain.colors(10, alpha = 0.5),
label_size = 4)
The cv_similarity function can check for environmental
similarity between the training and testing folds and thus possible
extrapolation. It computes multivariate environmental similarity surface
(MESS) as described in Elith et al. (2010). MESS represents how similar
a point in a testing fold is to a training fold (as a reference set of
points), with respect to a set of predictor variables in r.
The negative values are the sites where at least one variable has a
value that is outside the range of environments over the reference set,
so these are novel environments.
cv_similarity(cv = sb1,
x = pa_data,
r = rasters,
progress = FALSE)
This could be used to see the amount of di
cv_similarity(cv = ecv,
x = pa_data,
r = rasters,
progress = FALSE)
To support a first choice of block size, prior to any model fitting,
package blockCV includes the option for the user to look at
the existing autocorrelation in the response or predictors (as an
indication of landscape spatial structure). This tool does not suggest
any absolute solution to the problem, but serves as a guide to the user.
It provides information about the effective range of spatial
autocorrelation which is the range over which observations are
independent.
When only r is supplied, the
cv_spatial_autocor function works by automatically fitting
variograms to each continuous raster and finding the effective range of
spatial autocorrelation. Variogram is a fundamental geostatistical tool
for measuring spatial autocorrelation (O’Sullivan and Unwin, 2010).
sac1 <- cv_spatial_autocor(r = rasters,
num_sample = 5000)
The plotted block size is based on the median of the spatial
autocorrelation ranges. This could be as the minimum block
size for creating spatially separated folds. Variograms are
computed taking a number of random points (5000 as default)
from each input raster file. The variogram fitting procedure is done
using automap
package (Hiemstra et al., 2009), using the isotropic variogram and
assuming the data meets the geostatistical criteria
e.g. stationarity.
The output object of this function is an
cv_spatial_autocor object, an object of class S3.
# class of the output result
class(sac1)
## [1] "cv_spatial_autocor"
To see the details of the fitted variograms:
# summary statistics of the output
summary(sac1)
## Summary statistics of spatial autocorrelation ranges of all input layers:
## Length Class Mode
## 0 NULL NULL
## NULL
Alternatively, only use the response data using x and
column. This could be a binary or continuous variable
provided in as a column in the sample points sf object.
sac2 <- cv_spatial_autocor(x = pa_data,
column = "occ")
To visualise them (this needs the automap package to be
loaded):
library(automap)
## Loading required package: sp
plot(sac2$variograms[[1]])
Package blockCV also provides a visualisation tool for
assisting in block size selection. This tool is developed as local web
applications using R package shiny. With
cv_block_size, the user can choose among block sizes in a
specified range, visualise the resulting blocks interactively, viewing
the impact of block size on number and arrangement of blocks in the
landscape (and optionally on the distribution of sample points in those
blocks).
Using only raster data:
cv_block_size(r = rasters)
Or use only spatial sample data:
cv_block_size(x = pa_data,
column = "occ") # optionally add the response
Or add both raster and samples (also define a min/max size):
cv_block_size(x = pa_data,
column = "occ",
r = rasters,
min_size = 2e5,
max_size = 9e5)
Note that the interactive plots cannot be shown here, as they require
opening an external window or web browser. When using
cv_block_size, slide to the selected block size, and click
Apply Changes to change the block size.
Hiemstra, P. H., Pebesma, E. J., Twenhöfel, C. J., & Heuvelink, G. B. (2009). Real-time automatic interpolation of ambient gamma dose rates from the Dutch radioactivity monitoring network. Computers & Geosciences, 35(8), 1711–1721.
O’Sullivan, D., & Unwin, D. J. (2010). Geographic Information Analysis (2nd ed.). John Wiley & Sons.
Roberts, D.R., Bahn, V., Ciuti, S., Boyce, M.S., Elith, J., Guillera-Arroita, G., Hauenstein, S., Lahoz-Monfort, J.J., Schröder, B., Thuiller, W., others, 2017. Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography. 40: 913-929.
Telford, R. J., & Birks, H. J. B. (2009). Evaluation of transfer functions in spatially structured environments. Quaternary Science Reviews, 28(13), 1309–1316.
Valavi R, Elith J, Lahoz-Monfort JJ, Guillera-Arroita G. blockCV: An R package for generating spatially or environmentally separated folds for k-fold cross-validation of species distribution models. Methods Ecol Evol. 2019; 10:225–232. doi: 10.1111/2041-210X.13107